value <- "Hello World"
valueLS4003 R tutorial 3
T-tests in R
In this tutorial we’re going to use R to calculate our T-test results for the different examples given in the Statistics lecture.
See example below for the end point:
Make sure you’ve completed the tutorial 1 section on using R from excel before starting here.
Install and Set-up
A refresher for how to install and set up R and RStudio.
To get set up, follow the below steps. Click each step to see the instruction and the screenrecording.
This is an online, cloud-based option. It’s a bit more limited than running on a university computer or your own computer, but the free option should be enough for this module.
Go to Posit Cloud and create a free account
Log in, then go to New Project -> New RStudio Project.
Make a new folder in the bottom right panel (by clicking the New Folder button) called “LS4003_Statistics”.
Click on this folder to enter it, and then click the More cog (bottom right panel) and select “Set as Working Directory”.
To run R on your own machine, you have to install R (the programming language) and RStudio (the development environment).
When installing, click the most appropriate option for your machine (Windows/Mac/Linux)
Once you have installed both, open RStudio.
Navigate to your Documents folder in bottom right panel. (If you can’t find it, type in setwd("~/Documents") to the console on the bottom left, then click the More cog on the bottom right and select “Go to Working Directory”)
Create a new folder called LS4003_Statistics by clicking the New Folder button on the right hand side.
Click on your folder (LS4003_Statistics) to enter it.
Set that as your final working directory by clicking on the ‘More’ cog icon again and select “Set as Working Directory”.
Unpaired t-test
We’re going to be using the same examples as in Statistics lecture 3 for each of our tests.
Add the data into R
The first thing we need to do is copy our data in R. As these are fairly small datasets, we can manually type in our data using the c() function to create a vector of values.
Summarise the data
We can use the summary() function to see our mean and interquartile range for each set of values.
Calculate the unpaired t-test p-value
Using the function t.test(), we can add our two sets of values as arguments and calculate the p-value. This should match the result from excel.
That’s our t-test done already!
We didn’t have to use any options here as the default test is two tailed, unpaired, and assuming the variance is not equal. If you want to change any of the options, you can try:
- alternative for two tailed or one tailed
- “two.sided” for a two tailed test,
- “greater” for one tailed where the mean of the first variable is greater than the mean of the second variable
- “less” for one tailed where the mean of the first variable is greater than the mean of the second variable
unpaired_t_test_result <- t.test(NonDiabetic, Diabetic, alternative="greater")
- paired for paired or unpaired
- TRUE for paired
- FALSE for unpaired
unpaired_t_test_result <- t.test(NonDiabetic, Diabetic, paired = TRUE)
- mu mean, for comparison for a one-tailed test
Create a dataframe with the data
We can use the following code to create a dataframe with the above vectors.
We need a column that contains the correct group - “NonDiabetic” or “Diabetic” for each value. We can use the rep() function for this which replicates values.
c("NonDiabetic", "Diabetic")is a list of the items we want to repeateach = 16gives us a list of 16 of each. We could also putc(16,16).
Visualise the results as a boxplot
We can now create a boxplot in the same way we did in Tutorial 1. We can also add a t-test result using the function stat_compare_means() from the ggpubr package.
Dataframe to a list of values
We’ve just turned our two lists into one long dataframe, but what if we want to do the opposite?
If you want to extract the values for a particular group, we can use the following structure:
If you look in your environment, DiabeticExtracted should be identical to your Diabetic values.
Using this, if you are importing data from excel you can extract the values into groups to prepare for your t-test.
Paired t-test
Our paired t-test is very similar. First, let’s get copy data from our Basketball players example.
Summarise the data
We can use the summary() function to see our mean and interquartile range for each set of values.
Calculate the paired t-test p-value
Using the function t.test(), we can add our two sets of values as arguments and calculate the p-value. This should match the result from excel.
We need to specify the option paired = TRUE for it to be a paired t-test.
Create a dataframe with the data
We can use the following code to create a dataframe with the above vectors.
Fill in the gaps, and if you’re stuck have a look at how we did this for the glycemia dataset.
Visualise the results as a boxplot
We can do a boxplot and add our p-value in the same way as for the unpaired test by adding the option paired=TRUE.
One-tailed t-test
We can follow the same process for the one-tailed t-test. First, we need our list of grades from our example.
Summarise the data
We can use the summary() function to see our mean and interquartile range for each set of values.
Calculate the one tailed t-test p-value
Using the function t.test(), we can add our one set of values and the number we want to compare it to. In this example, we’re looking to see if our mean value is greater than 40.
We need to specify the option alternative = 'greater' to do a one-tailed test for if the mean of our Grades are significantly above our value for mu, which we have set to 40 (a pass).
Create a dataframe and boxplot
Because we only have one group of data values, we can create our dataframe and draw our boxplot very simply.
We don’t need to annotate our p-value on here - it wouldn’t make much sense to do so as our p-value was specifically assessing if the mean was higher than 40 which is hard to represent visually.
Anova
Our final example is using our one-way ANOVA. We’re using the example of salaries and degrees - note that there’s not the same number of values for each group.
Create a dataframe
To do an ANOVA, we need to organise our data into a dataframe.
Why have we used c(9,7,9) here? What if you change these numbers around?
If you make a change, have a look at your Degrees_DataFrame values and see if they still match our original data.
Calculate the ANOVA p-value
We can calculate the ANOVA p-value by using the aov() function.
We use salary ~ group to assess salary as a function of the degree group.
We always put the response variable before the ~ and the explanation variable after. Another way of reading salary ~ group would be “Salary depends on degree group”
Pairwise t test results
Our ANOVA was significant - but now we need to know which pairs are significantly different.
We can use pairwise.t.test() to do a t-test for all pairs of groups in our dataset.
We use the options:
pool.sd = FALSEso that variance is calculated independantly for each groupp.adjust.method = "none"as this function by default uses a Holm-Bonferroni correction to minimize false positives which is beyond the scope of this course.
Visualise the results
We can create a boxplot and plot all of our pairwise comparison p values.
To do this, first we can create a list containing all the pairs we want to plot the p-value for. Remember you only need to do each pair once - if you have c("Economics", "History") you don’t also need c("History", "Economics").
Extension
If you finish the tutorial, go back to the Worksheet 1 with the Penguins dataset.
Is there anything in this dataset that would suit a t-test? Which tests would you use?
Go through and see if you can find any significant differences between the three penguin species.